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applications of each of the techniques are discussed. 
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is discussed and sample jobs are included. 
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applied to a collection of data on tanks considering of 
twenty-four observations of ten attributes of tanks. The 
cluster analysis shows that the tanks cluster somewhat 
naturally by nationality. The principal components analysis 
and the discriminant analysis show that tank weight is the 


single most important discriminator among nationality. 
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I. DISCUSSION OF MULTIVARIATE DATA ANALYSIS 


A. INTRODUCTION 

As a set of statistical techniques, multivariate data 
analysis is concerned with data collected on several dimen- 
sions of the same observations. Techniques can be used for 
many purpose in the behavioral, mathematical, and adminis- 
EESPIVe sciences - ranging from rigidly controlled experi- 
ments to explain relationships assumed to be present in a 
large mass of data to attempts to cluster similar elements 
Or to find functions of the variables that will best 
discriminate among preselected subpopulations of the 
observations. 

The heart of any multivariate analysis consists of the 

Hus wnstrrx. This matrix is a table that gives the results 
of a number of observations on a number of variables 


simultaneously (Table I). 





Illustrative Data Matrix 


т 


Variables 
Observations 1 2 Suse Jo Xem p 
ИНН ы ER — o 
г SMS I җе Ыг 
1 к LN s. 
Е Соз $199 15 
n X x x x x 
nl n2 n3 nj np 
TABLE I. 


The table consists of a set of observations (the n 
rows) and a set of measurements on those observations (the 
p columns). Cell entries represent the value Xij of 
observation i on variable j . The values are 
characteristics of the observations and serve to define the 
observations in any specific study. The cell values may 
consist of nominal, ordinal, interval, or ratio-scaled 


measurements, or various combinations of these across columns. 





In a general sense "multivariate" analysis would con- 
cern two main features: 
iene multivariate Character lies in the 
multiplicity of the p variables, not 
in the size of the set n 
2. The variables are dependent among them- 
selves so that we can not split off one 
or more from the others and consider it 
by itself. The variables must be 
considered together. 
There are three characteristics often used as a basis 
for the classification of multivariate analysis: 
whether onets principal focus 15 on the 
objects or on the variables of the data 
matrix; 
2. whether the data matrix is partitioned 
into criterion and independent subsets, 
and the number of variables in each; 
3. whether the cell values represent 
nominal, ordinal, or interval scale 
measurements. 
This classification results in four major subdivisions of 
interest: 
Mingle criterion, multiple predictor 
Eo eO Nene amp Ene ES SONE 
analysis of variance and covariance, and 


two-group discriminant analysis; 
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2. multiple criterion, multiple predictor 
association, including canonical correla- 
tion, multivariate analysis of variance 
and covariance, and multiple discriminant 
analysis; 

3. analysis of variable interdependence, 

including factor analysis, multidimen- 
sional scaling, and other types of 
dimension-reducing methods; 

4. analysis of interobject similarity, 

including cluster analysis and other 
types of grouping procedures. 

The first two categories involve dependence structures 
where the data matrix is partitioned into criterion and 
independent subsets; in both cases interest is focused on 
the variables. The last two categories are concerned with 
interdependence - either focusing on variables or on 
observations. Within each of four categories, various 
techniques are differentiated in terms of the type of scale 
assumed. 

In this research, we consider only the following 
techniques o£ multivariate analysis: 

1. Principal components analysis 

2. Discriminant analysis 


3. Cluster analysis 
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II. PRINCIPAL COMPONENTS ANALYSIS 


The basic idea of principal components analysis is to 
describe the dispersion of an array of n points in p- 
dimensional space by introducing a new set of orthogonal 
linear coordinates so that the sample variances of the 
given data points with respect to these derived coordinates 
are in decreasing order of magnitude. Thus the first 
principal component is such that the projection of given 
points onto it have maximum variance among all possible 
linear coordinates; the second principal component has 
maximum variance subject to being orthogonal to the first; 
and so on. 

Suppose that the random variables Xi А X, P ; 
Хр of interest have a certain multivariate distribution 
with finite mean vector u апа variance-covariance matrix 
Е I 

From this population a sample of n independent 
observation vectors has been drawn. The observation can 


be written as the usual nxp data matrix. 


1 
X11 ааа ае зане Xip X] 

X =. : = : (1) 
i v! 
ОЕ а X 
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The estimate of Z will be the usual sample variance- 


covariance matrix S defined as follows: 





(2) 


> 
ii 
imp 


(Х - RQ, - 0 


The information we shall need for our principal compo- 
nents analysis will be contained in S . However, it will 
be necessary to make a choice of measures of dependence: 
should we work with the variances and covariances of the 
observations, and carry out the analysis in original unit 
of the responses, or would a more accurate picture of the 
dependence pattern be obtained if each х.. were trans- 


1) 
formed to a standarized score 


and the correlation matrix R employed? The components 
obtained from S and R in general not the same, nor is 
it possible to pass from one solution to the other by a 
Simple scaling of the coefficients. 

If the responses are in widely different Unics (i.e,, 
number of crew, weight in tons, speed in kilometer per 
hour, etc.) with large differences in the magnitudes, 


linear compounds of original quantities would have little 


ES 


meaning and standarized variates and correlation matrix 
should be employed. Conversely, if the responses are 
reasonably commensurable, the covariance form has a greater 
Statistical appeal, for the i-th principal component is 
that linear compound of the responses which explains the 
i-th largest portion of the total response variance, and 
maximization of such total variance of standard scores is 
Bprher' artificial. 

The first principal component of the complex of sample 
values of the responses Xi , X, e E TI ; X, is the 


linear compound 


1 841*4 Fee Bae (3) 


whose coefficients aj] are the elements of the eigenvector 
associated with the greatest eigenvalue hy of the sample 
variance-covariance matrix of the responses. The ај are 
are unique up to multiplication by a scale factor, and if 
they are scaled so that а'уа, -]1, the eigenvalue hy is 
interpretable as the sample variance of Yi 

Numerical representation of the first principal compo- 


nent is to find the vector А, such that 


Moa cuu E Lu а + a X 
(4) 
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which maximizes sample variance 


5? i i 
= qne AS a 
X 1 i=1 j=1 sid Pigs i Pin у 
(5) 
= A'4SA1 
BEN coefficient vectors normalized so that A',A, = 1 


] H 
To determine the coefficients, the normalization constraint 


is introduced by means of Lagrange multiplier and the 


resulting expression is differentiated with respect to А' 1. : 


2 
3 р = : "E í 
b Ay (Vo = AVAL) = сд ГА SA, * ALL - A tA 1 
1 1 1 Tp 
(6) 
= 2(° = AVDA, 
The coefficients must satisfy the p simultaneous linear 
equations. 
(° = A DA; = 0 (7) 
If the solution to these equation is to be other than the 
null vector, the value of A, must be chosen so that 
|S - MI = 0 (8) 


À is thus an eigenvalue of the variance-covariance matrix, 


1 


and A is its associated eigenvector. To determine which 


il 
of the p eigenvalues should be used, premultiply the 
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the system of equation (7) by А,' . Since A 


follows that 


But the coefficient vecotr was chosen to maximize this 
variance, and therefore, A, must be the greatest eigen- 
value of S 


The second principal component is that linear compound 


Азык Ec ds * 202 X (9) 


whose coefficients have been chosen, subject to the con- 


straints 


(10) 


so that the variance of Y, ; A,' S A, ‚ 15 а maximum. 
The first constraint is merely a scaling to assure the 
uniqueness of the coefficients, while the second requires 


that A and A, be orthogonal. 


1 
The coefficients of the second component can also be 
found by the Lagrangian technique with two multipliers A, 


and DNE DIfferentidcinp this with respect to A, gives: 
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sx [A ' S A, + A4 - A,’ A.) ОПА TER 


2 1 2! 


(99 
- 2(S - ADA, E uA, 
If the right-hand side is set equal to 0 and premultiplied 
by A1 ' » it follows from the normalization and orthogonality 


conditions that 


dU S A, O (12) 


Similar premultiplication of the equation (7) by A,' 


implies that 


ROUES AS mw Gls) 


and hence u = 0 . The second vector must satisfy 


dcm ADA, = 0 (14) 


And it follows that the coefficients of the second component 
are thus the elements of the eigenvector corresponding to 
the second greatest eigenvalue. The remaining principal 
components are found in their. turn in the same manner from 
the other eigenvectors. 

Thus the j-th principal component of the sample of 


p-variate observations is the linear compound 
уш а @ SQ еее pasa + a X (1.5) 
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whose coefficients are the elements of the eigenvector of 
the sample variance-covariance matrix S corresponding to 
the j-th largest eigenvalue Àj . If A, 7 A; , the 
Eoereichents o£ the i-th and j-th components are 
necessarily orthogonal; if E À; х 
chosen to be orthogonal, although an infinity of such 


the elements can be 


orthogonal vectors exists. The sample variance of the 
j-th components is Àj , and the total system variance is 


thus 


hy + ho TR * XA m trs (16) 


The importance of the j-th component in a more parsimonious 


description of the system is measured by 


A 


| (17) 
FS 


which gives the fraction of the total variance contributed 


to the j-th component. 
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III.  DISCRIMINANT ANALYSIS 


A. INTRODUCTION 

I basic idea of discriminant analysis consists of 
assigning an individual from a group of individuals to one 
of several known or unknown distinct propulations, on the 
basis of observations on several characters of the indi- 
vidual or group and a sample of observations on these 
characters from the populations if these are unknown. 

Fisher (1936) was the first to suggest a linear 
ШИ СТОП ОЕ маттаБјеѕ representing different characters, 
hereafter called the linear discriminant function (discrimi- 
nator) for classifying an individual into one of two popu- 
lations. Later research extended the analysis to chassifica- 
tion into one of k populations. 

For the univariate case Fisher suggested a rule which 
classifies an observation x into the i-th univariate 


population if 


X - X; = min (X - X x X4) Я д = ШЕ? (18) 


1 > 


where X. is the sample mean based on a sample of size N, 
from i-th population. For two p-variate populations 

Ti and т, (with the same covariance matrix) Fisher 
replaced the vector random variable by an optimum linear 


combination of its components obtained by maximizing the 
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ratio of the difference of the expected values of a linear 


combination under m, and T, to its standard deviation. 
He then used his univariate discrimination method with this 
optimum linear combination of components as the random 
variable. 

Rao (1948) considered the problem of classifying 
people into one of these populations castes of India. He 
assumed that each of the three populations could be 
characterized by four variables - structure (xi), sitting 
height (x5), nasal depth (xs), and nasal height (хд) - of 
each member of the population. On the basis of sample 
observations on these characters from the three populations 
the problem is to classify an individual with observation 
X = О жо уз. into one of three populations. He used 


An ear discriminator to obtain the solution. 


Bee THEORY 
In general, the underlying assumptions of discriminant 
analysis are: 
he groups being investigated are disemete 
and identifiable; 
2. each observation in each group can be 
described by a set of measurements on p 
characteristics or variables; 
5. these p variables are assumed to have 
a multivariate normal distribution in each 


population. 
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The purposes of discriminant analysis are: 

l. to test for mean group differences and to 

describe the overlaps among groups; 

2. to construct classification schemes based 

upon the set of p variables in order to 

assign previously unclassified observations 

to the appropriate groups. 
Hence, the problem of studying the direction of group 
differences is, equivalently, a problem of finding a linear 
combination of the original independent variables that 
shows large differences in group means. In short, dis- 
criminant analysis is a method for determining scuh linear 
combinations. 

The first step toward determining a linear combination 
of a set of variables such that several group means on this 
linear combination will differ widely among themselves, is 
to decide on a criterion for measuring such group-mean 
differences. Once a linear combination has been constructed, 
that means there is just a single tranformed variable. 
Hence, the F-ratio for testing the significance of the 
over all difference among several group means on a single 


variable suggests an appropriate criterion. 
Peer руы у (19) 


, V), a set o£ weight which 


where v' = (v4,V5, MU n UU p 


maximizes À 
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c 
u 
и ёз С 
24 
= 
ot 
ke 
' 
* 
=> 
— 
x: 
н 
' 
A 
Y 


=: 
H 
ц гэ © 


Xij iS the jth observation vector in the i-th 
group. 

x 1s the grand mean vector of the data. 

G is the number of groups. 

n; is the number of observations in the ith group. 

Prime notation indicates transpose. 
This ratio A _, called the discriminant criterion, was 
originally proposed by Fisher in connection with his two- 
group discriminant function. Once a criterion for group 
differentiation hasbeen determined, la setot weights, 
(vi Vo y frente ; Үр), which maximizes this criterion, 
should be determined. This is accomplished by taking the 


Mamedal derivative of A with respect to each component 


Vi su cand setting the result equal to Zero. 


dA , 2[(Bv)(v'Wv) - (v'Wv) (Wv) ] 


av (v'Wv) 
(20) 


AN, 
Е v'Wv 
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which is equivalent to 


(B - AW)v 2 O 
= (СШ) 
Qnis - XI)v = 0 
This equation is of the form 
(A - AI)v = 0 (22) 


It's solution, yielding the eigenvalues À and associated 
eigenvectors Vo of the matrix A , is therefore the same 
as in the principal components analysis, and thus the solved 
problem satisfies the problem of maximizing the discriminant 
errterion. 

In the last equation, the number of non-zero eigenvalues 
Of a square matrix A is equal to the rank of A. With 
wots Puayine the role of A , thé number of non-zero eigen- 
values depends on the rank of B , since the rank of the 
product of two matrices can not exceed the smaller of the two 
factor matrices' ranks, and wot (being nonsingular) must be 
ВЕЕ И rank p , while the rank of B is usually smaller 
man p . Thus it is possible to denote the rank of B by 
r = min (G-1,p) 

From the fact that the eigenvalues Ар are the values 
assumed by the discriminant criterion for linear combination 


using the elements of the corresponding eigenvectors P 


as combining weights, it is clear that the eigenvector 
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Mo = (V11»Y12» Waren ae Vi») provides a set of weights such 


that the transformed variable 


Yi - YX + V12X2 TO PA EE es + v, X (23) 


Maee the largest discriminant-criterion, A , achievable by 
any linear combination of the p independent variables. 


What are the properties of the remaining eigenvectors, 


Уз, Удун... Vp ? The second discriminant function 

к= V21X1 + v22x2 + Vap*p whose weights are the 

elements of the eigenvector v, associated with the second 
1 


largest eigenvalue %, of WB has the largest 
discriminant-criterion among those linear combinations of 
the X; that are uncorrelated with the first discriminant 
function in the total sample observation. Its proof is 
analogous of that of princpal components analysis. Each 
discriminant function has a relative (or conditional) 
posemumevalue for its discriminant criterion. Therefore, it 
needs nonly to show that Y, is uncorrelated with MI 


Noting that this correlation is proportional to v, ‘Tv, 


(where T = W + B), we have to prove that v¡ 'Tv, = 0 


(B - AiW)vi = 0 for each i (24) 


hence, 


Bv, = A. Чу, апа Bv, = A,Wv, 
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premultiplying these equations by v,' and VQ' respectively, 


1 - 172 ' МУ 


v4'Bv 1 
(25) 


' = 


taking the transpose of both sides of the first equation (B 


and W are symmetric) 


vi BV, = À4V4 МУ» 
thus 
AQVq'WVA 9 AQVQ'WV, 
(Ai E л.) ' МУ, = 0 
since hy f Az» V4'WV5 20 


therefore, Vi 'WV, - 0 which means V; and V3 are 
uncorrelated, and Y, has this property: its discriminant- 
criterion value, ^2 ; is the largest achievable by any 


linear combination of X's that is uncorrelated (in the 


total sample) with Үү. Similarly 
Ys 7 Vg4iX4 * Vg5X5*... E Узр^р (36) 


has the largest possible discriminant-criterion value (А5) 
among all linear combinations of the X's that are uncorrelated 


with both Y; and Y,; and so on until Y, using the 
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elements of V, as weights, has the largest possible 

discriminant-criterion value among linear combinations that 
are uncorrelated with all the preceding linear combinations 
Yi, ESSERE SE . The linear combinations MIDCE Nie: 


Y are called the first, second, ..... о т оп (JM apaec) 


г 
discriminant functions for optimally differentiating among 
the g given groups. 

ine situation here is reminiscent of principal com- 
ponents analysis. There, the dimension corresponding to the 
first component had maximum variance; the second-component 
dimension had maximum variance among those uncorrelated with 
the first; and so on. In discriminant analysis, the ratio 
of between-to within-groups sums-of-squares merely takes the 
place of variance as the criterion in determining the 
successive dimensions. However, an important difference 
between the dimensions identified in discriminant analysis 
and those in component analysis is that the former are 
generally not mutually orthogonal in the test space, even 
though they are uncorrelated. That is, the axis representing 
the discriminant functions are not a subset of axes obtainable 
byeriusid rotation of the original system of p axes; the 
discriminant rotation in an oblique rotation. 

Must as im the principal components analysis, the 
dimensions represented by the discriminant functions may be 
interpreted meaningfully. Even if they are not, it may be 
possible to achieve parsimony by reducing the dimensionality 


of the space needed to describe group differences. In 
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seeking to interpret the discriminant functions, the goal is 
to determine which of the original p variables contribute 
most to each function. For this prupose, comparison of the 
realtive magnitudes of the combining weights as given by the 
elements of each eigenvector of Ws is inappropriate 
because these are weights to be applied to the variables in 
raw-score scales, and are hence affected by the particular 
unit used for each variable. 

To eleminate the spurious effects of units of measure- 
ment on the magnitudes of combining weights, standarized 
variables should be used. 

The relative magnitudes of these standarized weights may 
be assessed by multiplying each raw-score weight by the 
standard deviation of the corresponding variable as computed 
from the within-groups SSCP (Sum of Squares, Cross product) 
matrix: This amounts to multiplying each element of a given 
EL о: Уй by the square root of the corresponding 
diagonal element of W . Thus, for each m , define 

Nr US T T E ‚р (27) 
as the standarized discriminant weights. The relative con- 
tribution of the i th variable to the m th discriminant 


function may then be gauged by the magnitude of v in 


comparison with the other weights E 
Up to this point, it has been shown that the dimension- 


ality of the discriminant space is equal to the number of 
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nonzero eigenvalues of wots ; Which is the smaller of the 


two numbers, 6-1 апа р . It may often happen, that the 
number of significant discriminant dimensions may be even 
maller. That is, not all of the discriminant function may 
represent dimensions along which statistically significant 


Paeup differences occur. 


C. SIGNIFICANCE TEST IN DISCRIMINANT ANALYSIS 

A basic quantity in testing the significance of the 
overall difference among several group centroids (mean 
vectors) the ratio of the determinants of the within- 
G osos nd the total SSCP matrices, known as Wilks’ A 


Criterion. 


L= u (28) 


Motivation for use of this equation may be seen as follows: 


peim b" 
= olo + B)| (29) 


Е рн оа 


: -1 
where Хх1,5%2›,+++›Аү are the nonzero eigenvalues of W B 
Consequently, Bartlet's V statistic for testing the 


significance of an observed value can be expressed as 
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vo = [N= 1 -"» * 6)/2]1nA 


[МЧ - 1 - (p + G)/2] In [(1 + 44) 1 + Я (30) 


T 
AG EN) 
m = 1 


Mee Statistic is distributed approximately chi-square with 
p(G-1) degrees of freedom. 
Because of the uncorrelatedness of the successive 

discriminant functions, the successive terms 1п(1 + Am? 
in the last expression above are statistically independent 
(assuming multivariate normality of the original p 
variables). As a result, the additive components of V are 
each approximately distributed as a chi-square variate. 
More specifically, the m th. component, 

Va NE a G 72 el el An) @ D) 
NEN Гр ох пасе1у chi-square with p **G - 2m degrees of 
freedom. It may be readily verified that the sum of the 
number of degree of freedom (n.d.f) of the r components, 
Bras. р =бС-2) + (+С = 4) * ...... +(p mG = 2r) , 
u arco p(G - 1) regardless of whether r= G- 1 or 
p 

Consequently, when we cumulatively subtract Vi, Vo ; 

and so on from V , the remainder each time is also a 
chi-square variate; and these successive remainders become 
appropriate statistics for testing whether the residual 


discrimination after removing the first discriminant 
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function, the first and second discriminant function, and 
i or h, is statistically significant. The successive 


feet) statistics and their n.d.f.'s may be summarized as 





follows: 

Residual After Approximate 

Removing "statistic n.d.f. 
eer manent W = V4 p(G-1) - (p*G-2) 
unction = (p-1) (G-2) 

Burst 2 discriminant V-V -V (p-1) (G-2) - (p+G-4) 

; 1 2 

Function = (p-2) (G-3) 

et discriminant V - V4 RON V3 (p-2) (6-3) - (p*G- 6) 
unction =(p-3)(G-4) 


ES EOS discriminant 


V-V4 -V5-V«...-V.l (p-s) (6-(s*1)) 
Function D A 


As soon as the residual, after removing the first s 
discriminant functions becomes smaller than the prescribed 
percentile point (that is, the 100(1 - o)th percentile) of 
the appropriate chi-square distribution, we may conclude 
that only the first s discriminant functions are 
E ucant at that œ level. If the number of signifi- 
cant discriminant functions thus found is smaller than r 
(as will often be the case), we will have effected a further 
reduction in the dimensionality of the space required to 


describe the differences among the G groups from which 
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our sample groups were drawn. The remaining r-s dimensions 
may be regarded as immaterial for population differentiation, 
since our sample differences along these dimensions can be 


attributed to sampling error. 


ЭЕ 





DU CLUSTER ANALYSIS 


A. ORIGIN AND THEORY 

eluserinesissthe grouping of similar objects. The 
principal functions of clustering are to name, to display, 
to summarize, to predict, and to aid in interpretation of 
data with many dimensions. Clustering techniques were 
Peo eveloped in the field of biological taxonomy. It is 
one of several methodologies included in the broader cate- 
gory called classification. 

The cluster analysis problem is the last step we 
consider in the progression of category sorting problems. 
While in discriminant analysis some part of the structure 
is known and missing information is estimated from labeled 
samples, the operational objectives of clustering is to 
classify new observations, that is, recognize them as members 
of one category or another. In cluster analysis little or 
nothing is known about the category structure. All that is 
available is a collection of observations whose category 
membership are known. We seek to discover a category 
structure which fits the observations. The problem may be 
stated as one of finding the "natural groups", which means 
to sort the observations into groups such that the degree of 
"natural association" is high among members of the same 


group and low between members of different groups. 
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Cluster analysis techniques have been applied in many 
fields of study. The literature is both voluminous and 
diverse, the terminology differing from one field to 
another. “Numerical taxonomy" is frequently substituted 
for cluster analysis among biologists, botanists, and 
ecologists, while some social scientists may refer "typology". 
Other frequently encountered terms are pattern recognition 
and partitioning. While discriminant analysis has been 
studied by statisticians for nearly 45 years, cluster 
analysis has only recently come to statistical notice. Any 
method which partition a set of objects into subsets on 
the basis of measurements taken on every object qualifies 
as a clustering method. 

Most of the well known clustering techniques fall into 
one of two main categories: (1) hierachical and (2) non- 
Meerzenicale(partitioning). The former is one in which 
every cluster obtained at any stage is a merger of clusters 
at previous stages. The nonhierachial procedures however 
form new clusters by lumping and splitting old ones. We 
consider both categories shortly. 

In a geometric sense, every observation may be viewed 
as a point in p-dimensional Euclidean space. This swarm of 
data points may contain dense regions or "clouds" of data 
points which are separable from other regions containing a 
low density of points. These denser regions constitute what 
are known as clusters. In one and two dimensional cases, it 


is easy to visualize and to detect the clusters from scatter 
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plots, assuming that the clusters exist. In higher 
dimensions, clustering becomes extremely difficult without 
the aid of a computer. 

Mathematical clustering techniques usually require a 
measure of similarity to be defined for every pairwise 
combination of the entities to be clustered. In order to 
solve the cluster problem, it is desirable to define the 
terms "similarity" and "difference" in a quantitative 
fashion. A researcher would assign two observations to 
the same group if the distance between them is sufficiently 
small, or to different clusters if this distance is 
ES Ugcyently large. 

At this point, two questions may be brought on. The 
first one is "how do we measure the distance between the 
observations?" and the second o is "how small is small 
enough?" and how large is large enough? These will be 


discussed in the following sections. 


B. MEASURES OF DISTANCE 
Jo General 
Let E, be a symbolic representation for a 
measurement in p-dimensional space and let X,Y, and Z be 
any of these points in En . Then any nonnegative real- 
valued function D(X,Y) satisfying the following conditions 


qualifies as a distance function (or metric). 


j. ) 0 if and only if X= Y 


Ze DOGGY) 


Iv 


0 Fo r a x ando vem En 
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ОООО (ХУ) = D(Y,X) 


DNS DO) + р(ү,2) 


IA 


Many clustering algorithm assume such distances given and 
set about constructing clusters of objects within which the 
distances are small. The choice of distance function is no 
less important than the choice of variables to be used in 
the study. A serious difficulty in choosing a distance lies 
in the fact that a clustering structure is more primitive 
than a distance function and that knowledge of clusters 
changes the choice of distance function. Thus a variable 
that distinguishes well between two established clusters 
should be given more weight in computing distances than a 


"junk" variable that distinguishes badly. 


2. Euclidean Distance 
The Euclidean distance between the I-th and K-th 
observations of a data matrix X is defined as 


“л? 


D(I,K) -| {ХП хк (52) 


> 
J < J < jo 


where J is J-th variable. In one, two, or three 
dimensional space, this is just a "straight line" distance 
Between the vectors corresponding to the I-th and K-th 
observations. When the variables are measured in different 
units, it is necessary to prescale the variabes to make 
their values comparable or, equivalently, to compute a 


weighted Euclidean distance. 
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z 2]1/2 
D(I,K) = NOTO) ERIK (35) 
1 J <p 


This form of distance is not necessary if all variables are 
measured on the same scale. However, even in this case, 
weights might be used to increase or decrease the importance 
of same variable. Various weighting schemes have been 
utilized in practice. One common weighting scheme lets 
W(J) be the reciprocal of the variance of variable J 

A general class of squared distance functions is 
provided by utilizing positive definite quadratic forms. 
Specifically, if p represents a p-dimensional observation 
to be assigned to one of s groups, then to measure the 
squared distance between the observation 8 and the 
centroid (mean vector) of the i-th group one may consider 
the function 

De A D 
al In 1% 

where M is a positive definite matrix to ensure that 
D,<<0 . Different distance functions are represented by 
different choices of the matrix М. When M = I (the 
identify matrix) the resulting metric is the standard 
Euclidean distance. Distances with the Euclidean metric are 
Shown in Figure la. The variance within the data may make 
the unweighted Euclidean metric inappropriate. As shown on 


the Figure 1b, where X has a larger variance than Y , 
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one may wish to weight a deviation in the X direction less 
than an equal deviation in the Y direction. This is a 
weighted Euclidean distance frunction which makes point A 
БОСИ cequidistance from the origin. In this case, the 
matrix M is diagonal elements which are the reciprocals of 
the variances of the different variables. 

Extending this idea further, it may be possible to 
consider the covariance among variables as well. Figure lc 
shows how the axis may be rotated so that the major axis is 
oriented in a direction of reflecting the positive correla- 
tion between X and Y . Again, points on the same 
ellipse are considered equidistance from the origin. The 
matrix M in this case is the inverse of the covariance 
matrix. 

Further extension of this concept will expalin some sort 
of generalized distance function. If C; Fe p e sets све 
Covariance matrix of the i th cluster then the distance 


function 


uses the appropriate covariance structure when determining 
the distance to a particular cluster centroid. Since C; 
Elanoes to reflect the dispersion internal to each particular 
eluster. the use of this metric exploits differences ain the 
dispersion characteristics of the different groups. As shown 


on Figure ld, not how a new observation (denoted by u) is 


oid 
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la. Euclidean measure lb. Measure of squared distance 


of squared distance. 
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with different weights for variables. 


lc Generalized squared ld. Classification when within- 


distance measure. 


Figure 1. 


group dispersions are different. 


Euclidean Distance 
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closer to the centroid of group one (Gl) in terms of 
Euclidean distance but is more likely to be assigned to 


group two (G2) when using the C; matrix. 


3. Mahalanobis Distance 
Mother choice for the M matrix in equation (1) 
- zd 
is p where P represents the pooled within groups 


covariance matrix of all the clusters. 


= (34) 


where G 
W = ЖОМ 
k = 


This distance is the well known Mahalanobis distance. Note 
that P does not change from group to group. To ensure 

the non-singularity of P it must be true that p < (N - 6), 
where N represents the total number of observations over 


all groups. Rewriting the distance, 


a me dc] _ 
D; = XD) W ` (B Xi ) (55) 
defines a distance between mean vectors 8 апа Xi and 
common covariance matrix W . The Mahalanobis distance 


function adjusts for both scale of measurement of the 


variables and covariation among the variables. Use of this 
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metric is equivalent to computing distances on variables 
Сато готпеа to their principal components. This metric is 
invariant under any nonsingular transformation of original 


variables. For consider the transformation 


me 
i 


BX (36) 


and let D(Y; »Y;) represent Mahalanobis distance between 


p and Y; 


Tort 
D(Yi, Yi) G Y) por Y;) 


T 
(BX, - BX,) Pgi(BX, BX;) 


- (X; - = CNN = X;) 
E (EPE) BOX, Xj) 
— xj) PAM) 

- D(X;,X;) 


Some other common metrics are listed below: 


IO ma (City Block) 


1 


IX: ER 
1 ki 


и с2а п 


DXi Xj) = oe 


2% Lo norm (Minkowsky Metrics) 


с2 gT 


= à p,1/p 
DOG ,X;) C NS X5 P) 
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5. Uniform norm 


D(X; 5X5) = Superemum Clix 


E 
DEMO D EE 


C. HIERARCHICAL CLUSTERING 


Jo General 

The previously discussed distance measures may be 
used to construct a similarity matrix describing the length 
of all pairwise relationships among the entities (variables 
or data units) in the data set. The methods of hierachical 
cluster analysis operate on this similarity matrix to con- 
struct a tree depicting specified relationships among the 
entities. As shown on Figure 2, the branches on the left 
each represent one entity while the root represents the 
entire collection of entities. Moving down the tree 
from the branches toward the root depicts increasing aggre- 
pato or the entities into clusters. Hierarchical 
clustering methods which build a tree from branches to 
root often are called agglomerative methods. 

Once a tree is constructed for N entities, the 
analyst may choose from as many as  N sets ot clusters. 
These clusters are nested. From the agglomerative view, 
when two entities are merged they are joined togehter per- 
manently and considered as one entity for later merges; from 
the divisive view, when a group of entities is split into 
two parts, the parts are separated permenently and may be 


treated independently for the remainder of the analysis. 
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Figure 2. Tree for Hierarchical Clustering 


Herein lie both the strength and weakness of 
hierarchical methods: by taking early decisions as perma- 
nent, the number of posibilities that need be examined is 
reduced greatly as compared with complete enumeration; 
but this same convention precludes discovering early 
mistakes or capitalizing on later opportunities. 

There are three major hierarchical clustering concepts: 

1. Linkage Methods 
2. Centroid Methods 


3. Error sum of squares or variance methods. 


All of these methods are suitable for clustering data units. § 
However, only the linkage methods are considered in this 
research. 


2. The General Agglomerative Procedure 


Let Sij be the similarity between entities i 
and j as defined by one of the distance measures previously 
discussed. Assuming that the similarity is symmetric, the 


complete schedule of similarities for all ( = SN(N - 1) 
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possible pairwise combinations of entities may be arrayed 


in a lower triangular similarity matrix as in Figure 3. 


The entries are nonnegative. This limitation is of 


$i; 
consequence only for correlation and the cosine of the angle 
between vectors; the distinction between positive and nega- 
tive association cannot be utilized in these clustering 


methods. 





Figure 5. Lower Triangular Similarity Matrix 


A simple remedy is to use the absolute value or the square of 
the measure if it can assume negative values. Once the 
matrix is defined, the process of clustering entities is 
almost trivially simple. The general procedure for agglomer- 
ative clustering on a data matrix is as follows: 
(1) Begin with n clusters each consisting of 
exactly one entity. Let the clusters are 


labled with the numbers 1 through N. 
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(2) Search the similarity matrix for the most 
similar pair of clusters. Let the chosen 
clusters be labeled p and q and let 


their associated similarity be Sng’ Perg, 

(3) Reduce the number of clusters by 1 thorugh 
merger of clusters p and q . Label the 
product of the merger q and update the 
similarity matrix entities in order to 
reflect the revised similarities between 
cluster q and all other existing clusters. 
Delete the row and column of S pertaining 
to cluster p 

(4) Perform steps 2 and 3 a total of N-1 times 
(at which point all entities will be one 
cluster). At each step record the identity 
of the clusters which are merged and the value 


of similarity between them in order to have 


a complete record of the results. 


Different agglomerative methods are implemented by 
varying the procedures used for defining the most similar 
pair at step 2 and for updating the revised similarity 
matrix at step 3. The similarity matrix is a given array 
of numbers. The numerical execution of the clustering 
procedures is completely independent of how the similarity 
values were generated or whether the entities to be 
clustered are variables or data units. However, it is 


necessary to make a direct distinction between distance-like 
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measures (the smallest values correspond to the most similar 
pairs) and correlation-like measures (the largest values 
correspond to the most similar pairs); the essential 
ference is whether the search for the most similar pair 
involves seeking the minimum or maximum entry in the simi- 
larity matrix. 
Single Linkage 

The method of single-linkage cluster analysis is 
the simplest of all hierarchical techniques. At each stage, 
after clusters p and q have been merged, the similarity 
between the cluster (labeled t) and some other r is 


determined as follows: 


jm IER Sij is the distance-line measure 
Spy = min (Spy? Sqr) (37) 
The quantity Sr is the distance between the two closest 


members of clusters t and ro. If clusters t and r 
were to be merged, then for any entity in the resulting 
cluster the distance to its nearest neighbor would be at 


most s 


tr 
pe TE Sij is a correlation-like measure 
= 8 
oe max (Spr? Sqr? (38) 
The quantity S,, is Gae Similarity between the two most 


similar entities in clusters t and r. Tr Clusters € 
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and r were to be merged, then for any entity in the 
resulting cluster there would be at least one other entity 
in the same cluster such that the pair would have a similarity 
at least as large as Sr 

The method is known as single linkage because clusters 
are joined at each stage by the single shortest or strongest 
link between them. Since the updating process involves 
choosing only the minimum or maximum single-linkage clustering 
is invariant to any transformation which leaves the 
ordering of the similarities unchanged; that is, any monotonic 


transformation. 


4. Complete Linkage 
The complete-linkage method is related to the single- 


linkage method and is no more difficult to execute. At each 
stage, after clusters p and q have been merged, the 
similarity between the new cluster (labeled t) and some 


other cluster r is determined as follows: 


Ix Xf Sij is distance-like measure 
S,, = max (Spr? Sqr) (39) 
The quantity oer is the distance between the most distant 
members of clusters t and r. If clusters t and r 


were merged, then every entity in the resulting cluster 


would be no farther than s from every other entity in 


tr 


the cluster. The value of Sm is the diameter of the 


samllest sphere which can enclose the cluster resulting from 


the merger of clusters t and r. 
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AR If Sij is a correlation-like measure 


уз) (40) 


me qr 


Sois NLD = 


The quantify S is the similarity between the two most 
dissimilar entities in clusters t and r. If clusters 
t and r were to be merged, then every entity in the 
resulting cluster would have a similarity of at least Str 
With every other entity in the cluster. 

The method is called complete linkage because all 
entities in a cluster are linked to each other at some 
maximum distance or minimum similarity. Such a cluster 
is called a "maximally connected subgraph" in graph theory. 
In contrast to the single-linkage method, interpretation 
of the clusters can be made only in terms of the relation- 
ships within individual clusters; there is no particularly 
useful interpretation involving the differences between 
clusters. Like the single-linkage method, complete-linkage 
cluster analysis is invariant to monotonic transformations 
of the similarity measure. Johnson (1967) discusses this 


property in both single and complete linkage methods. 


D. NONHIERARCHICAL CLUSTERING 

Nonhierarchical clustering methods are designed to 
cluster data units into a single classification of g 
clusters, where g either is specified a priori or is 


determined as a part of the clustering method. The central 


idea in most of these methods is to choose some initial 
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partition of the data units and then alter cluster member- 
Ennps so as to obtain a better partition. The various 
algorighms which have been proposed differ as to what 
constitutes a "better partition" and what methods may be used 
for achieving improvements. 

The broad concept for these methods is very similar 
to that underlying the steepest descent algorithms used 
for unconstrained optimization in nonlinear programming. 
Such algorithms begin with an initial point and then 
converge to a local optimum, moving one step at a time, 
the value of the objective function improving at each step. 

Bhesmethods of nonhierarchical clustering typically 
may be used with much larger problems than the hierarchical 
methods because it is not necessary to calculate and store 
the similarity matrix; it is not even necessary to store 
the data set. In general, the data units are processed 
serially and can be read from tape or disk as needed. This 
characteristic makes it possible, at least in principle, to 
cluster arbitrary large collections of data units. 

In this research, we consider only the partitioning 
method known as "K-MEANS'" which was developed by MacQueen 
(15). He used the term "K-MEANS" to denote the process of 
sen cach data unit to that cluster (of k clusters) 
with the nearest centroid (mean vector). The cluster 
centroid changes with each transfer of an observation. 

The decomposition of the total scatter matrix into 


within and between groups matrices suggests possible 


48 





optimality criteria to be used in a clustering algorithm. 
One would like the within-groups scatter to be small 
relative to the between-groups scatter. Various trial 
clusterings could be formed using the W and B matrices 
as a basis for the optimality criteria which determine the 
best clustering. A possible choice for a criterion is to 
memamize trace W over all partitions into g groups. 
Since T is constant over all partitions, minimizing trace 


М is equvalent to maximizing traces В since 


trace T = trace W + trace B (41) 


Although trace W is invariant under an orthogonal 
transformation, it is not invariant under other non-singular 
linear transformations. 

McRae (16) points out that trace W equals the total 
within group sum of squares, hence the "minimum variance 
Pauyeicıon) cluster solution is found by minimizing trace W 

Considerable study has been developed to alternative 
criteria such as those based on multivariate statistical 
analysis techniques, especially the methods of linear 
discriminant analysis and multivariate analysis of variance. 
Assuming the p variables are not linearly dependent, then 
as long as p =N - g , W is positive definite symmetric 
E 


and so is W Attempts to make В and W as different 


as possible lead one to solving the determinantal equation: 


|B - AW| = 0 (42) 
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The solutions A; ате the eigenvalues of the matrix wl 
as in discriminant analysis. There are t non-zero 
eigenvalues, where t is the minimum of p and g-1 

ШИ i a Consequence of the fact that, if g is less than 
p ,.the g group means are considered in a (g-1)-dimensional 
hyperplane. When g = 2 the analysis is equivalent to 
two-group discriminant analysis. Linear discriminant 
analysis would take the vectors originally described in 
p-dimensional coordinate system and transform the basis to a 
t-dimensional system. Maximizing the largest of these 
eigenvalues is a criterion suggested by S.N. Roy and 
maximizing the trace of wot » however is a criterion 
suggested by Hotelling. In both cases, large values for 
these statistics are sought in clustering algorithms since 
large values indicate large differences among (between) 
groups. Minimizing the ratio of determinants || + |Т| 

is a criterion widely known as Wilks' lambda discussed in 
enesdiseriminant analysis. Since T is the same for all 


pas tiutions, this criterion is equivalent to minimizing 
1 


determinant W . Both trace WB and |T| = |W| may 
be expressed in terms of the eignevalues of wig 
T t 
т pe Ch ae) (43) 
= 
-1 t 
trace WB = È À; (44) 
i=1 
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muere t = min(p, g-1) . Therefore minimizing det W is 
equivalent to maximizing т(1 + Ai). 

Friedman and Rubin (6) describe the advantages of the 
various criteria. Those based on multivariate statistical 
considerations (all but trace W ) are invariant under 
Ehanges in scale for varibles (non-singular linear trans- 
formation). In fact, they are the only invariants for W 
and B under such transformations. In addition, the 
multivariate criteria may take into account covariation 


among the variables. 
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V.__ ANALYSIS OF MULTIVARIATE UTILITY DATA 


To illustrate hierarchical clustering we applied the 
technique described in the previous chapter to partition a 
set of twenty six attributes of a close-air support weapon 
System into a smaller collection of "superattributes". As 
part of an effort to evaluate the military utility of a 
proposed alternative U.S. Marine Corps air support rada 
system, AN-TPQ/27. Barr and Richards (4) extracted 26 
attributes of the TPQ-27 and a baseline system, the AN-TPQ/10, 
and then had members of the Operational Test and Evaluation 
Team assess the utility of the TPQ/27 relative to that of the 
TPQ/10. In order that the additive model used to combine 
unidimensional relative utilities into a system relative 
utility be justifiable, it is necessary that the utilities 
satisfy certain independence properties described in Keeney 
and Raitffa (12). 

Because those independence properties are very diffi- 
cult for decision makers to verify for complex alternatives 
like the weapon systems under study, Professors Barr and 
Richards attempted instead to work with the attributes to 
try to generate a new collection which would likely satisfy, 
at least approximately, the conditions required to justify 
the additive model. 


fe Foricinalicollection of 26 attributes is as follows: 
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10, 
"p 
12 
15. 
14. 
DS. 
Bb. 
D. 
18. 
Эг 
20. 
Du 
22. 
235. 
24. 
25. 
ZOR 


Portability 

Durability 

Time to Set Up 

Time to Take Down 

Ease ot Assigning Aircraft to Targots 
Number of Aircraft Controlled 
Number of Targets 

Communications 

Mission Flexibility 

ASRT Survivability 

Time to Locate and Acquire Aircraft 
Accuracy of Tracking 

Accuracy of Delivery 

Range 

Aircraft Vulnerability 

Aircraft Attack Throughout 

Base of Adjustment and Evaluation of Results 
Accuracy of Feedback 

Ease of Operation 

Man-Machine Compatibility 

Training Requirements 

Reliability 

Maintainability 

Supportability 

Availability 


Documentation 
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where a, represents an attribute i and 


0 ЗЕБ E f Xk 


It is easy to verify that D is a metric as defined in 
Chapter IV. Since we will actually work with a similarity 
measure in the hierarchical cluster procedure, we define 


the similarity between two attributes a; and a, as 


ime 


$ (а. 


i > ау) E I (xij > Xy5) (46) 


j ih 
One can see from this definition that the similarity between 
two attributes a; and ay is simply the number of team 


members who placed attributes ai and a, in the same 


Partition. For example, 


S(a4 ; a) = 0+ 1 + Oe 1 40 + 0 * 0c ты рр 
+1+0=4 


Either S or D can be used in the computer program 

shown in Appendix A for hierarchical clustering. One need 
only indicate whether he wants a correlation-like (larger 
values imply more similar) measure or a distance-like 
measure (smaller values imply more similar). We selected to 


use the former method. The similarity matrix extracted from 


54 





Data Matrix 


Table LL 


ИНИ? 


10 


10 


11 


12 


15 


14 


15 
f6 


17 


18 


19 


20 


21 


22 


25 


24 


25 


26 


55 





from the data is shown in Table V-3. We present only lower 
triangular elements since S(ai р a,) LAO aia and 
the matrix is symmetric; i.e., 5 (а; И ау) = 5 (ау A aj). 
Zero values are not written. 

The results from the hierarchical clustering are shown 
in Figure 4. The numbers printed along the left hand 
margin refer to the attribute numbers. As you proceed to 
the right through the tree you will observe numbers greater 
than 26. These correspond to the clusterings that takes 
place from one step to the next. For example, the number 
27 shown at the juncture of 25 and 22 means that the first 
attribute clustered together should be 25 and 22 (this is 
the most similar pair). This combination is then considered 
as a new attribute which is later combined with the attribute 
30 (itself a combination of 23 and 24) to form the attribute 
31. This is later combined with attribute 2 to form 
attribute 40, ete. 

As discussed in Chapter IV a decision has to be made as 
to how many clusters (superattributes) are desired. А11 
hierarchical methods will continue clustering until there 
is a single cluster. In order to decide on the number of 
clusters (and their composition) one need only image drawing 
a vertical line through the tree at various places. Each 
intersection of the tree with the vertical line results in a 
cluster. For example, teh vertical line at the point A 


results in the 6 clusters shown in Table V-4. 
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It is clear from observing the above collection that 
some of the attributes are highly correlated and nonredun- 
dant. If one tries to assign an importance weights to each 
attributes separately, there is a distinct likelihood that 
some of the overlapping strongly into related attributes 
might effectively be double or triple weighted or more 
producing biased result. It is an effort to prevent this 
from happening, Barr and Richards aksed the utility assess- 
ment team to partition the 26 attributes into a smaller 
collection in such a way that attributes within a group are 
similiar and attributes in different groups are unrelated the 
sense that utility assessments for attributes in one group 
do not depend on the amounts of attributes in any other 
group. 

The total number of groups was not prespecified. 
Instead, each team member was allowed to partition the 26 
attributes into any number of groups. The resulting multi- 
variate data array is shown in Table V-2. An element Xi; 
is the number of the group into each team member j put 
attribute i. 


Let us define a distance measure for this data array 


as follows: 


D(a; »a,) - (1 - I(x (45) 


Шога гә 


E xx; 
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Table III. Similarity Matrix for Superattribute Determination 
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Figure 4. Tree PROS 26 Attributes 
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The superattributes used in the utility study are those 
sohwn in Table IV. A careful examination of the attributes 
which comapre the clusters shows that the results so obtained 
are intuitively agreeable. The names supplied to the 
superattribures are somewhat natural descriptions of the 
clusters obtained. 

The listing of the computer program and sample output 


are given in Appendix A. 
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Table IV. 


Superattrıbütes 


Facility of movement 


Superattributes 


Component Attributes 


Portability 
Time to set up 
Time to take down 


Facility of Use 


(precision) 


Ease of assigning 
aircraft to targets 
Number of aircraft 
controlled 

Number of targets 
Mission flexibility 
Time to locate and 
acquire aircraft 
Aircraft attack 
throughput 

Ease o£ adjustment 
Accuracy of feedback 
Ease of operation 
Man-machine compatibility 
Accuracy of tracking 
Accuracy of delivery 
Range 


ASRT Survivability 
Aircraft vulnerability 


Training requirements 
Documentation : 


Durability 
Reliability 
Maintainability 
Supportability 
Availability 
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VI. ANALYSIS OF ARMY TANK DATA 


A. DATA STRUCTURE 


In order to illustrate the nonhierarchical clustering 


methodology, principal components analysis, and discriminant 


analysis data on Army tanks from eight different countries 


were taken from Jane's Book of Weapon Systems (1979-80). 


A total of twenty-four tanks were included in the data 


array with observation on each of 10 variables. The 10 


variables are listed below: 


1. 
20 


а 
10. 


Weight (ton) 

Length (meter) 

Width (meter) 

Height (meter) 

Road Speed (kilometer per hour) 
Trench Crossing (meter) 

Ground Pressure (Kg/ cm*) 
Maximum Armament (rounds) 
Ground Clearance (meter) 


Power to Engine Ratio (BHP/ton) 


The twenty-four tanks and the associated countries are 


shown below: 


Identification Number Type/Name Country 
11 T-62 
12 Т-54 US. OR: 
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Identification Number Type/Name Country 


> T-10 
14 ASU-85 
eS MK-5/Chieftain 
16 MK-3/Vickers 
17 MK-13/Centurion U.K. 
18 CVR(T)/Scorpion 
19 XM-1 
20 M60A2 
21 M60 I OCA 
22 M48 
25 М47 
24 PZ61 
SWITZER- 
25 PZ68 LAND 
26 STRV-103 
SWEDEN 
27 Ikv-91 
28 TYPE61 
JAPAN 
29 TYPE74 
30 Leopard 2 
31 Leopard W. GER- 
MANY 
52 ТАМ 
55 АМХ 30 
ЕВЕМСН 
34 AMX 13 
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We conjecture that a cluster analysis of the tank data will 
result in clusters corresponding to nationality since the 
nations may have different emphasis on the variables in 


the design of their tanks. 


B. NONHIERARCHICAL CLUSTER ANALYSIS OF TANK DATA 
1. The MIKCA Algorithm 
The specific algorithm chosen for the nonhierarchical 
cluster analysis for the tank data is the MIKCA (Multivariate 
Iterative K-MEANS Clustering Algorithm) program written by 
Douglas J. McRae as a part of his doctoral dissertation 
at the University of North Carolina, Chapel Hill. 

Reference to the flow chart in Figure 5 will aid the 
reader in following discussion of the algorithm. Inputs to 
program are the data matrix, an estimate for g (the 
number of clusters), and choice of criterion and distance 
functions. 

In the first step, preliminary claculations are made, 
such as the variable means and standard deviations, as 
well as the cross product matrix T . The next step forms 
the initial cluster centers. Then each of the other 
observations is assigned to the nearest cluster. Euclidean 
distance is used for this initial phase, and the cluster 
centroids are recomputed after each observation is assigned 
to a group. The observations are considered in the same 
order as they were input. After all of them have been 
assigned to clusters, the criterion value is computed. 


This initial cluster-finding technique is referred to as a 
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YES 


PRELIMINARY 


CALCULATIONS 


INITIAL CLUSTER 
SOLUTION 





K - MEANS 








CRITERION NO 
IMPROVEMENT ? 
INDIVIDUAL SWITCHES 
NO 


Figure 5. MIKCA Flow Chart 
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one-pass K=MEANS procedure. It is performed three times, 
and the solution which yields the best criterion value is 
chosen as the initial cluster solution. 

After the initial solution has been found, the program 
advances to the iterative K-MEANS phase where the observations 
are again considered in the order in which they were input to 
the program. It is this phase where the user's choice o£ 
distance function is used. The distance from each observa- 
tlon to each cluster centroid is again computed, this time 
with the user's distance function, the assignment to the 
closest centroid being made and the centroid updated to 
reflect its new membership. After considering all n 
observations in this manner, the new criterion value is 
checked for possible improvement during the K-MEANS iteration. 
As long as the criterion value improves, the K-MEANS 
procedure is repeated; i£ the criterion fails to improve then 
the MIKCA algorithm goes to the next step, the individual 
switches section. Note the importance of the order o£ 
consideration of the observations. The order is important 
because the cluster means are recomputed after each observa- 
tion is reassigned. 

In the individual switches phase, consideration is given 
to moving each observation to every other cluster, the move 
being made if and only if an improvement in the value of 
the criterion results. An elaborate labelling procedure 


provides a unique order in which to consider each observation.. 
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This procedure continues until a complete pass through the 
data is made with no changes in cluster membership. 

The MIKCA alogorithm provides the following options for 
distance and criterion functions. 

Criterion 

1. Minimum trace W 


2. Minimum determinant W 


3. Maximum largest order of |B - AW| = 0 
4. Maximum sum of roots of |B - AW| = 0 
Distance 


1. Euclidean 
2. Weighted Euclidean 
3. Mahalanobis 


A complete computer program is listed in Appendix B. 


2. Cluster Results for Tank Data 
For clustering of the tank data we selected the 
minimum trace W criterion and the weighted Euclidean 
distance function. The algorithm automatically provides 
weights for the weighted Euclidean distance function. 
The results of the clustering with four clusters are shown 
in Table V. 

The conjecture of clustering by nationalities is 
supported by the results. The three Soviet tanks make up 
one cluster and the two British and four of the United 
States tanks were found to be similar. A third cluster 


consists of four tanks which are very lightweight. The 
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Anae luster consists of the rest of the tanks, including 
tanks of United States allies from West Germany, France, 
Sweden, Switzerland and Japan. 

A natural question to ask after observing the results 
of a cluster analysis is what variables most strongly 
influence the clustering that was observed. A clue is 
provided by the composition of the cluster containing all of 
the lightweight tanks. This suggests that weight is an 
important distinguishing feature. This is examined in the 
principal components analysis and the discriminant analysis 


in the next two section. 


C. PRINCIPAL COMPONENTS ANALYSIS 

The Statistical Package for Social Sciences (SPSS) 
(14) subprogram FACTOR was used for the principal components 
analysis. It is designed both for the factor analysis and 
the principal components analysis. The outputs are designed 
to be self-explanatory. In this example, the first 5 
components accoung for 90% of the variance and the remaining 
components account for only 10% of the variance (Figure 6). 

The subprogram FACTOR provides a graphical presentation 

(Figure 7) for the factors that have been determined by the 
orthogonal rotations (in this example, variance maximization 
rotation). In reading the graphs, one should be attentive 
eomecollowing three features: (1) the relative distance of 
a variable from the axis, (2) the direction of a variable 
in relation to the axis, and finally (3) clustering of 


variables and their relative position to each other. 
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For example, variables 5 (road speed) and 10 (power to 
engine ratio) contribute heavily to the first principal 
component while variables 1 (wieght) and 3 (width) 
contributes most strongly to the second principal component. 
Variables 2, 4, 6, 7, 8, 9 are not as important. The 
weights accorded each variable in the 10 factors (principal 
components) are shown in Figure 8. The complete SPSS 


program is listed in Appendix C. 


D. DISCRIMINANT ANALYSIS 

The SPSS subprogram DISCRIMINANT was used to determine 
that function or those functions of the 10 variables that 
best discriminant among the four clusters determined in 
previous section. 

The maximum number of discriminant functions to be 
derived is either one less than the number of groups or equal 
to the number of discriminating variables. This subprogram 
provides two measures for juiging the importance of 
discriminant functions. One of these is the relative 
percentage of the eigenvalue associated with the function. 

It is a measure of the relative importance of the function. 
The sum of the eigenvalues is a measure of total variance 
existing in the discriminating variables. Since discriminant 
functions are derived in order of their importance, this 
Оос с сап be stopped whenever the relative percentage is 
muaeed co be too small. Of course, there is no fixed rule 
for deciding whatis too small. In this research, we selected 


waray, a Significance level of 0.10. The output shown 


T3 





in Figure 9 suggests that we therefore consider only the 
first two discriminant functions. 

The second measure judging the importance of a 
discriminant function is its associated canonical correlation. 
The canonical correlation is a measure of association between 
the single discriminant function and the set of (g-1) dummy 
variables which define the g group memberships. It tells 
us how closely the function and the group variable are 
related, which is just another measure of the function's 
ability to discriminate among the groups. From Figure 10, 
the first two discriminant functions are each highly corre- 
lated with the groups but the third has only a moderate 
correlation. 

The next criterion for eliminating discriminant 
functions is to test for the statistical significance of 
discriminating information not already accounted for by 
the earlier functions. As each function is derived, 
starting with no (zero) functions, Wilks' lambda is computed. 
Lambda is an inverse measure of the discriminating power in 
original variables which has not yet been removed by the 
discriminant functions - the larger lambda is, the less 
is the information remaining. Lambda can be transformed into 
a chi-square statistic for an easy test of statistical 
significance. In Figure 9, Wilks' lambda was .594 after 
the first two functions had been derived. This corresponds 


to a chi-square of 8.8476 with a probability level of .1823. 
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Figure 10. 





This means that a lambda of this magnitude or smaller has a 
.1825 probability of occurring due to the chances of 
sampling even if there was no further information to be 
accounted for by a third function in the population. 
Mariy, a third function is not statistically signrficant 
in this — 

The standarized discriminant function coefficients 
corresponding to the values of the Vij 5 discussed in the 
previous section are used to compute the discriminant score 
for a case (observation) in which the original discriminating 
variables are in standard form. The discriminant score is 
computed by multiplying each discriminating variable by its 
corresponding coefficient and adding together these products. 
There is a separate score for each observation on each 
function. The coefficients have been derived in such a way 
that the discriminant scores produced are in standard form. 

When the sign if ignored, each standard discriminant 
function coefficient represents the relative contribution of 
its associated variable to that function. The sign merely 
denotes whether the variable is making a positive or:negative 
contribution. 

A graphical presentation is shown in Figure 11 using 
the first and the second canonical discriminant function as 
the axis. From this scatterplot, we can easily see that 
Soviet tanks (labelled 1) are well distinguished from the 


all of the others using only the first two discriminant 
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functions. Also, all the lightweight tanks are clearly 

separated from the others. The distinction between groups 
2 and 3 is also clear though not separated from each other 
as much as from groups 1 and 4. The complete SPSS program 


for the discriminant analysis is listed in Appendix D. 
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VII. CONCLUSION 


The multivariate analysis techniques of cluster analysis, 
principal components analysis and discriminant analysis are 
useful in real world problems for examining observations 
on each of several dimension. Each of the techniques is 
related mathematically to the others, and each complements 
the other in explaining the data. 

Computer software is readily available in many sources. 
The software used in this thesis for hierarchical clustering, 
principal components analysis, and discriminant analysis was 
from the IMSL package and SPSS. For nonhierarchical 
clustering, we used the FORTRAN program developed by McRae 
(16). All of this software is readily available and 


documented at the Naval Postgraduate School. 
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